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The Gene Ontology Consortium (GOC) is a community-based bioinformatics project that classifies gene product function 
through the use of structured controlled vocabularies. A fundamental application of the Gene Ontology (GO) is in the 
creation of gene product annotations, evidence-based associations between GO definitions and experimental or sequence- 
based analysis. Currently, the GOC disseminates 126 million annotations covering >374000 species including all the king- 
doms of life. This number includes two classes of GO annotations: those created manually by experienced biocurators 
reviewing the literature or by examination of biological data (1.1 million annotations covering 2226 species) and those 
generated computationally via automated methods. As manual annotations are often used to propagate functional pre- 
dictions between related proteins within and between genomes, it is critical to provide accurate consistent manual anno- 
tations. Toward this goal, we present here the conventions defined by the GOC for the creation of manual annotation. This 
guide represents the best practices for manual annotation as established by the GOC project over the past 12 years. We 
hope this guide will encourage research communities to annotate gene products of their interest to enhance the corpus of 
GO annotations available to all. 
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Introduction 

The Gene Ontology Consortium (GOC; http://www.geneon 
tology.org) is a bioinformatics resource that serves as a 
comprehensive repository of functional information about 
gene products assembled through the use of domain- 
specific ontologies (1). The project is a collaborative effort 
working to describe how and where gene products act by 
creating evidence-supported gene-product annotations to 
structured comprehensive controlled vocabularies. The 
Gene Ontology (GO) is a controlled vocabulary composed 
of >38 000 precise defined phrases called GO terms that 
describe the molecular actions of gene products, the 



biological processes in which those actions occur and the 
cellular locations where they are present. First developed in 
1998 (2), the GOC project has grown to become an inte- 
grated resource providing functional information for a 
wide variety of species. As of January 2013, there are 
> 1 26 million annotations to >19 million gene products 
from species throughout the tree of life. Of these there 
are 1.1 million manually curated annotations, from pub- 
lished experimental results, to 234000 gene products. As 
the GOC develops the standard language to describe func- 
tion, it also defines standards for using these ontologies in 
the creation of annotations. This article elaborates on the 
methods and conventions adopted by the GOC curation 
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teams for constructing annotations and serves as a guide to 
new or potential annotators, and the biological community 
at large, for understanding the requirements necessary to 
create and maintain the highest quality GO annotations. 

Overview of GO annotations 

The goal of the GOC is the unification of biology by creat- 
ing a nomenclature used for describing the functional char- 
acteristics of any gene product, protein or RNA, from any 
organism. There are two parts to a GO annotation: first, the 
association asserted between a gene product and a GO def- 
inition; and second, the source (e.g. published article) and 
evidence used as the authority to make the assertion. The 
GO is a set of highly structured directed acyclic graphs 
(DAGs); its structure and content have been extensively 
described elsewhere (2, 3). Here, we limit our presentation 
to the GO term name (the phrase that is typically used 
when discussing individual components of the ontologies, 
often shortened to 'GO term'), the GO definition, the text 
string that explains the precise meaning of the GO term 
and a numerical identifier called the GOID (examples used 
in this guide are shown in Table 1). In addition, each term 
can have multiple ontological relationships to broader 
(parent) and more specific (child) terms (Figure 1 illustrates 
how terms and relationships are represented in GO). 

Although annotations are typically viewed as connec- 
tions between a gene product and a GO term, it is import- 
ant to stress that the GO term name is a surrogate for the 
definition, and that the biological concept described by the 
definition is really the core assertion being made by an 
annotation. This is a subtle yet important point central to 
understanding the power of the GO, one that is not always 
appreciated by both annotators and consumers of GO an- 
notations. As with a spoken language, the understanding 
of its usage is based on shared definitions of the phrases 
and definitions of the terms. Thus, annotating to the def- 
inition is required to alleviate confusion if the names of 
biological concepts or terminology used in the published 
literature are ambiguous. 

The source of the information used to make an anno- 
tation includes both a specific reference, usually a pub- 
lished scientific article represented by a PubMed identifier 
(PMID), that describes the result of an experimental or 
computational analysis on which the association was 
based, and an evidence code (Table 2) that reflects the 
type of experimental assay or analysis that supports the 
association. Annotations can be asserted manually from 
the literature by biocurators or computationally by auto- 
mated methods. This article will focus on standards defined 
by the GOC for manual curation. Computational annota- 
tion methods and their guidelines have been reported 
elsewhere (4). 



Annotation format 

GO annotations are recorded and supplied in a standard 
tab-delimited file format called the Gene Associations 
File (GAF, http://www.geneontology.org/GO.format.anno 
tation.shtml). For each annotation, the GAF format con- 
tains both required and optional fields, some of which 
will be discussed below. The required fields are — the iden- 
tifier of the gene product being annotated, the GOID of 
the GO term associated with the gene product, an evidence 
code and the reference (either a published article or a GOC- 
specific internal reference) supporting the use of the GOID, 
the aspect of the ontology (Molecular Function, Biological 
Process, Cellular Component), the curation project that cre- 
ated the annotation, the object type that is being anno- 
tated (see below), the NCBI taxonomy database identifier 
for the species of the gene product and the date the anno- 
tation was created or modified. A sample annotation is 
shown in Table 3. 

Manual curation 

Within the GOC, manual annotations are made by experi- 
enced biocurators from a variety of annotation projects 
including, but not limited to, the Saccharomyces Genome 
Database [SGD, (6)], Mouse Genome Informatics [MGI, (7)], 
WormBase (8), PomBase (9), FlyBase (10), ZFIN (11) and 
UniProt (12). Manual curation typically encompasses two 
approaches. The first involves reading relevant publications, 
identifying the gene product(s) of interest, and ascribing 
the reported experimental results to a GO definition 
using an appropriate evidence code (Table 2). The second 
involves inferring a gene's role by manual examination of 
its nucleic acid or protein sequence motifs, structure or 
phylogenetic relationships. For consistent interpretation 
of experimental results and sequence analysis, the GOC 
has established annotation guidelines that are elaborated 
below. GOC member projects (http://www.geneontology. 
org/GO.consortiumlist.shtml) with assistance from other 
groups engaged in advancing the representation of biolo- 
gical function so that it can be presented in a straightfor- 
ward but precisely defined form have developed these 
guidelines. Over time these guidelines have evolved into 
required standards for all manual annotations and have 
been incorporated into validation tools used by the GOC 
to maintain their quality and uniformity. 

Gene product: Object of 
annotation 

The annotation object or molecular entity are those defined 
by the Sequence Ontology [(13), http://www.sequenceontol 
ogy.org] and includes complex, gene, gene_product. 
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(a) 



GO:0003674 
molecular function 



(b) 



GO 0003824 
catalytic activity 






GO: 00 
hydrolas 


16787 
e activity 



GO 0016801 
hydrolase activity, 
acting on ether bonds 



GO 0016803 
ether hydrolase 
activity 



GO 0004463 
leukotriene-A4 
hydrolase activity 



□ all : all [649919 gene products] 

El D GO:0003674 : molecular_function [528279 gene products] 
0 O GO:0003824 : catalytic activity [207669 gene products] 
H D GO:0016787 : hydrolase activity [75465 gene products] 

Q 1 GO:0016801 : hydrolase activity, acting on ether bonds [263 gene products] 
0 O GO: 00 168 03 : ether hydrolase activity [143 gene products] 

B<iC5Q^0P04463 : leukotriene-A4 hydrolase activity [30 gene productiEZS> 



(c) 





exact: (7E,9E,llZ,14Z)-(5S,6S)-5,6-epoxyicosa-7,9,ll,14-tetraenoate hydrolase activity 
exact: leukotriene A(4) hydrolase activity 
exact: leukotriene A4 hydrolase activity 
exact: LTA-4 hydrolase activity 
exact: LTA4 hydrolase activity 
related: LTA4H 

Catalysis of the reaction: H(2)0 + leukotriene A(4) = leukotriene B(4). 



Figure 1. GO Term 'leukotriene-A4 hydrolase activity' [GO:0004463], one of the terms mentioned in the main text of the article, 
as seen in AmiGO (16, http://amigo.geneontology.org). (a) Graphical view of the ontology structure showing the most granular 
term 'leukotriene-A4 hydrolase activity' [GO:0004463] at the bottom (highlighted in red), and all its parent terms leading up to 
the root node ('molecular_function' [GO:0003674]) at the top. Each box representing a GO term includes the GO identifier, and 
the blue line connecting the terms represent the ontological relationship 'is_a' (implying that a child term is a subtype of the 
parent term), (b) Alternate text display for viewing the ontology structure. 'leukotriene-A4 hydrolase activity' [GO:0004463] is 
highlighted in red. Each child term is indented from its parent to indicate the depth of the tree. Apart from the GOID and GO 
term, each row includes other pieces of information that are important to understand the ontology and the annotations to each 
term. Starting from the left end of the row, the + sign indicates that there are child terms for that node and clicking on the + sign 
opens the browser to display the child terms. Next the small icon 'i' indicates the term is related to its parent by an is-a 
relationship (explained above). At the right end of the row in brackets is the total number of gene products annotated to 
that term and all its child terms, (c) Term information relevant to making an annotation is highlighted in red, which includes the 
GOID, Aspect of the ontology (Molecular Function), Synonyms and Definition of the term. 



miRNA, ncRNA, protein, protein_complex, protein_struc- 
ture, RNA, rRNA, snoRNA, snRNA, transcript, tRNA and poly- 
peptide. While annotations are typically created for 
chromosomal features, such as a gene for its protein or 
ncRNA product, other types of objects can be annotated 
including groups of gene products that make a complex. 
The annotation object can be associated to a GO term 
from one or more of the three aspects of the GO 
(Molecular Function, Biological Process and Cellular 
Component). A gene product is the most common object 
of annotation, and all such objects require a stable identifier 
such as those specified by sequence databases maintained 
by European Bioinformatic Institute (EBI) and National 
Center for Biotechnology Information (NCBI). Model 



Organism Databases (MODs) also maintain unique identi- 
fiers that often represent specific types of molecular entities 
such as RNA transcripts that often do not have an identifier 
from one of the archival repositories. 

Approaching an article for curation 

When experimental data on a gene product has been pub- 
lished, the following guidelines can be used to identify the 
relevant or annotatable pieces of information that may 
generate GO annotation for that gene product. 

(i) Identification of relevant articles describing a gene 
product's function is the essential starting point for 
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Table 3. A sample annotation in the GAF 2.0 format 



Column 


Content 


Required? 


Example 


1 


DB 


Required 


MGI 


2 


DB Object ID 


Required 


MGI:1 350922 


3 


DB Object Symbol 


Required 


Cadps 


4 


Qualifier 


Optional 


NOT 


5 


GO ID 


Required 


GO:0006887 


6 


DB:Reference (|DB:Reference) 


Required 


MGI:MGI:3583730|PMID:1 5820695 


7 


Evidence Code 


Required 


IMP 


8 


With (or) From 


Optional 


MGI:MGI:3583931 


9 


Aspect 


Required 


P 


10 


DB Object Name 


Optional 


Ca2+-dependent secretion activator 


11 


DB Object Synonym (|Synonym) 


Optional 


CAPS1 


12 


DB Object Type 


Required 


Protein 


13 


Taxon(|taxon) 


Required 


Taxon:10090 


14 


Date 


Required 


20060202 


15 


Assigned By 


Required 


MGI 


16 


Annotation Extension 


Optional 


Occurs_in(CL:0000001)|occurs_in(CL:0000336) 


17 


Gene Product Form ID 


Optional 


UniProtKB:Q80TJ1 



This table provides an example of an annotation from the Mouse Genome Informatics group (from February 2013). The Cadps protein 
(MGI identifier MGI:1 350922) was annotated by the MGI project to 'exocytosis' [GO:0006887], a term in the Biological Process ontology 
indicated by 'P' in column 9. This annotation used the 'NOT' qualifier indicating the authors of PMID:15820695 (5) showed that this 
protein is 'NOT' involved in 'exocytosis'. The non-PMID reference number, MGI:MGI:3583730, is MGI's internal identifier for the same 
reference. The curators arrived at this annotation based on the phenotype of the Cadps mutant, which is indicated with the IMP 
evidence code. The identifier of the allele (MGI:MGI:3583931) used in the experiment is captured in column 8 (WITH/FORM). In addition, 
the annotation extension field (column 16) indicates the cell types where this protein (CL0000001, primary cell culture or CL0000336, 
adrenal medulla chromaffin cell) was NOT found to be involved in this process (exocytosis). Finally, the last column represents the 
UniProtKB identifier for the isoform of the mouse Cadps protein that was studied. 



making annotations. While PubMed is a typical start- 
ing point for finding relevant articles, research in the 
area of Natural Language Processing (NLP) provides 
additional methods that can aid in the search for 
curatable articles. More on NLP methods used for bio- 
curation can be found in the reports from the 
Biocreative workshops (14). Once an article has been 
identified, biocurators must properly specify the ob- 
jects of annotation including confirmation of the cor- 
rect taxa. These details are often found in the 
Methods section of the article, but unambiguously 
determining species for annotation can be problem- 
atic, particularly in vertebrate systems where ortholo- 
gous gene names are shared among taxa. Further, 
when multiple model organism systems are being 
used simultaneously, the taxa of the genes being 
investigated is not always specifically designated. 
For example, Lin and Isaacson (15) studied axonal 
growth regulation by netrin and slit proteins using 
both mouse and rat cells. Two of the plasmids con- 
taining slit coding sequences were acknowledged as 
gifts and no reference to the species of origin was 



provided. In this case, to determine the species the 
sequences represent, the authors had to be contacted 
to confirm that the sequences actually originated 
from human, neither mouse or rat. 

(ii) The Introduction section of the article will often pre- 
sent previous knowledge about the gene product's 
function. If citations to original works are included 
then the article can be used as a source of the infor- 
mation and annotated using the evidence code (see 
below) Traceable Author Statement (TAS). The use of 
TAS evidence has decreased over time, as it is best 
practice to go to the original article to capture the 
annotation directly from experimental results. This 
allows for clear attribution of an annotation to the 
original experimental details. Thus, GOC strongly dis- 
courages the continued use of TAS and recommends 
replacing existing TAS annotations with those to the 
published experimental results. 

(iii) Annotations derived from experimental data are 
most often found in the Methods and Results sections 
or in the figure legends of articles. A biocurator can 
efficiently receive an overview of the biological 
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context of the article from the Introduction section 
and then, using the experimental data in the 
Results section, create annotations with appropriate 
supporting experimental evidence, 
(iv) Authors often speculate on the role of the gene 
product in the Discussion section based on the experi- 
mental results they present. The authors may propose 
a hypothesis that combines previous knowledge, new 
findings from the current study and new ideas that 
have not yet been experimentally verified. This infor- 
mation is not suitable for an annotation assertion and 
if used to create an annotation can be detrimental, as 
these hypotheses have not been validated. 



Manual curation using sequence 
similarity data 

Manual curation by biocurators includes the in-silico ana- 
lysis of chromosomal features to infer a gene product's role 
and location. GO terms can be assigned to gene products 
on the basis of sequence similarity using the evidence code 
'Inferred from Sequence or structural Similarity' (ISS) with a 
custom reference, GO_Reference (GO_REF:0000024), as 
described in the next section. Potential homologs are ini- 
tially identified using sequence similarity search programs 
such as BLAST. The significance of the sequence similarity is 
then verified manually using a combination of sequence 
resources and analysis tools, including phylogenetic and 
comparative genomics databases such as Ensembl 
Compara (16), INPARANOID (17) and OrthoMCL (18). In all 
cases, biocurators validate each alignment to assess 
whether similarity is appropriate to infer the gene prod- 
uct's function. While there is no universal definition for 
the minimum requirements for similarity results, the signifi- 
cance of a match is judged on a case-by-case basis by the 
biocurator's expertise. Although the similarity criteria 
required to make these annotations are defined by the 
annotating group, the GOC has established several rules 
for making these assignments. They are as follows: 

(i) Mandatory inclusion of a stable database identifier 
that identifies the similar gene/gene product in the 
'WITH/FROM' field (column 8 in Table 3) 

(ii) The similar gene must be experimentally character- 
ized; to avoid circular inferences, the GO term 
should only be assigned if the similar gene/gene 
product is, or can be annotated, with the same 
term (or a more specific child term) using an experi- 
mental evidence code (e.g. Inferred from Direct assay, 
IDA; Inferred from Mutant Phenotype, IMP; Inferred 
from Genetic Interaction, IGI, Inferred from Physical 
Interaction, IPI; Inferred from Expression Pattern, IEP). 



Annotations made with the NOT qualifier should not 
be transferred. 

Sequence characteristics can be used to infer GO anno- 
tations for all three aspects of the ontology. However, care 
should be taken when transferring biological process anno- 
tations, as cellular processes and metabolic processes, for 
example, may be more readily inferred from sequence 
similarity than developmental processes which may be 
species- or clade-specific 

Use of GO reference 

As mentioned above, manual curation does not always re- 
quire a published reference to indicate the source of evi- 
dence. Annotations can be inferred by biocurators by 
analysis of the gene sequence or by combining direct 
experimental evidence from multiple sources. In these situ- 
ations, the citation is to a custom reference. These so-called 
GO references describe the methods and procedures used 
in creating such annotations. For example, GO_REF: 
0000024 (http://www.geneontology.org/cgi-bin/references. 
cgi#GO_REF:0000024), titled 'Manual transfer of experi- 
mentally verified manual GO annotation data to orthologs 
by curator judgment of sequence similarity', was created to 
describe the transfer of manual annotations using curator 
judgment to annotations associated with the ISS code. A 
second example is GO_REF:0000036 (http://www.geneon 
tology.org/cgi-bin/references.cgi#GO_REF:0000036), 
'Manual annotations that require more than one source of 
functional data to support the assignment of the associated 
GO term.' This GO reference is used with the Inferred by 
Curator (IC) evidence code, described below. GO references 
are created and published on the GOC Web site (http:// 
www.geneontology.org/cgi-bin/references.cgi) only once 
the biocurators agree on the content of the abstract and 
its usage. 

How to define an annotation? 

Once literature relevant to a gene product has been iden- 
tified, the following guidelines can be used to decide which 
GO term(s) and evidence code(s) should be associated. 
Individual articles may not provide results that support an- 
notations for all three aspects of the ontology; thus, anno- 
tations to the different aspects will generally need to come 
from different articles. Also it is common, from a single 
article, to identify multiple annotations identified for one 
aspect and to annotate to different levels of granularity in 
the same branch of the ontology. The granularity of the GO 
term selected depends heavily on the type of experiments 
being reported as well as the ability of the biocurator to 
understand the limitations of that experimental method. 
MacCullen (19) interviewed biocurators from the GOC in 
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an effort to correlate the curator's education, work experi- 
ence and research experience to measured variability in an- 
notation. After observing there was significant variability in 
a test set of annotations, he explored possible causes. 
MacCullen reported no correlation between the amount 
of variation and any specific characteristic of the biocura- 
tor's education or experience and suggested that biocura- 
tors should continually work to coordinate annotation 
methods with the goal of minimizing variation. The solu- 
tion used by the Consortium's member projects is to have 
continuing education and discussions between biocurators 
to reduce variability that arises from inconsistent use of the 
rules and misunderstanding of the ontology terms. Also to 
further address the variability in the interpretation by bio- 
curators, the GOC holds regular controlled annotation ex- 
ercises to define standards and maintain consistent 
procedures. These exercises are conducted within and 
across most projects where biocurators annotate the same 
article or a small set of articles and then compare their an- 
notations. A discussion follows where the GOC comes to a 
consensus about the most appropriate annotations for that 
article and in the process educates its staff. 

Choosing the right GO term 

As emphasized above, ontology terms should be chosen 
based not on the term name, but on the definition of the 
term. Ontology terms can be explored using AmiGO (20), 
http://amigo.geneontology.org, or QuickGO (21), http:// 
www.ebi.ac.uk/QuickGO/. Often it is hard to find the appro- 
priate GO term using the description or phrases from the 
literature because GO terms can be more descriptive and 
they reflect the actual function or process rather than a 
gene product name or family name. Therefore, to assist in 
searching, and to accurately reflect the language of biology, 
many ontology terms are associated with synonyms, which 
are typically the terminology or language used in the litera- 
ture. For example, the phrase 'transcription repressor' is 
loosely used in the literature to refer to any transcription 
repressing role. This concept is represented in the GO as 
'negative regulation of transcription, DNA-dependent' 
[GO:0045892], and the phrase transcription repressor is a 
synonym of this term. Development of the ontologies (i.e., 
adding new terms, refining definitions) is an active process 
and if an appropriate GO term that is suitable to describe a 
gene product is not available, biocurators are encouraged to 
request that a new term be added to the ontology. The GOC 
has setup several ways to handle new term requests and to 
evaluate existing terms. The easiest way is to contact the GO 
helpdesk (go-helpdesk@geneontology.org or http://www. 
geneontology.org/GO.contacts.shtml) providing as much 
detail as possible. 



What if nothing is known about 
the gene product? 

Typically after an organism's genome sequence is deter- 
mined, structural annotation is performed using computa- 
tional methods to make gene model predictions. Some of 
the resulting predicted genes will have been previously 
characterized and as a result will have literature-associated 
evidence or sequence based relationships to other well- 
defined genes. For other predicted genes neither experi- 
mental nor sequence based functions will be available. 
This represents sets of similar proteins that have yet to be 
characterized and proteins without similarity to any previ- 
ously characterized sequence. Thus no literature is available 
on which to base an annotation. In cases such as this where 
nothing can be gleaned from the literature, it is correct to 
associate the gene product to the most general terms in the 
three ontologies, 'molecularjfunction' (GO:0003674), 'bio- 
logicaLprocess' (GO:0008150) and 'cellular_component' 
(GO:0005575) (called the root nodes, see Table 1) with the 
evidence code No Data (ND). It should be noted that anno- 
tating to the root node specifically states that an extensive 
search of the literature was conducted and no experimental 
results were found to indicate the function of this gene 
product. Since a biocurator infers that nothing has been 
published about the gene product, a custom reference 
(not a published article) that documents this curatorial pro- 
cedure (the 'GO reference' GOJ!EF:0000015) should be 
included in a ND annotation. These ND annotations are 
used by projects such as SGD that have hunted through 
the published literature for reported functions of all gene 
products in the budding yeast. In this way the users can 
trust that a literature search did indeed occur. The use of 
ND is important because the absence of an annotation 
could mean that a function has been reported but no GO 
annotation has been captured or that there is no evidence 
available. Annotation projects should routinely explore any 
newly published works describing genes in their area of 
interest to determine if any new experimental results are 
available. Once new annotations have been defined, exist- 
ing ND annotations for that gene product should be 
removed. 

It is especially important that biocurators make sure the 
results presented in the article fit all parts of the term def- 
initions; biocurators should not rely only on the term name. 
In the following, we present guidelines for commonly en- 
countered curation issues observed for the individual 
ontologies. 

Molecular function 

Molecular Function describes activities, such as catalytic, 
binding or transporter activities, at the molecular level 
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(e.g. 'protein kinase activity' [GO:0004672], '6-phosphofruc- 
tokinase activity' [GO:0003872], 'transcription factor bind- 
ing' [GO:0008134], 'alanine transmembrane transporter 
activity' [GO:0022858], see Table 1 for GOIDs and defin- 
itions for these GO terms). GO molecular function terms 
describe activities rather than the entities (complexes, 
gene products or molecules) that perform the actions. 
Typically direct assays such as enzyme kinetics measure- 
ments or binding studies can be used to infer molecular 
function annotations. In addition sequence comparison 
methods are often used to predict the molecular function 
of a gene product because functions are often associated 
with conserved protein domains (see Figure 2 to compare 
evidence from experimental and nonexperimental results). 

• Deciding between a Molecular Function and a 
Biological Process term takes practice. The key question 
to ask when selecting a Molecular Function term is 
whether the experimental results show 'how' the 
gene product accomplishes its role. For example if the 
result simply shows that a mutant version of a gene 
product affects transcription, by itself that doesn't 
show that the gene product is a transcription factor. 
If instead the study shows that transcription is modu- 
lated when the gene product binds to DNA or protein, 
then an appropriate Molecular Function term ('se- 
quence-specific DNA binding RNA polymerase II tran- 
scription factor activity' [GO:0000981] or one of the 
child terms of 'protein binding transcription factor ac- 
tivity' [GO:0000988]) would be correct. In contrast, data 
from a mutant phenotype experiment could be used to 
make a Biological Process annotation to the term, 'tran- 
scription, DNA-dependent' [GO:0006351] or to one of 
its child terms (see Table 1 for GOIDs and definitions). 

• Only GO terms that can be supported by the experi- 
mental results should be selected, based on the GO 
term definitions. For example, if the Introduction of 
an article states that a gene product is a transcription 
factor but only provides experimental results showing 
DNA binding, then this article is not appropriate for an 
experimentally based annotation to 'sequence-specific 
DNA binding RNA polymerase II transcription factor 
activity' [GO:0000988]. The appropriate term would be 
'sequence-specific DNA binding' [GO:0043565] (see 
Table 1) or a more specific DNA binding term. In 
another situation, if the authors show via sequence 
comparison methods that a protein is a serine/threo- 
nine/tyrosine kinase, but only show experimental evi- 
dence for phosphorylation of serine and threonine, 
the biocurator must only annotate to 'protein serine/ 
threonine kinase activity' [GO:0004674] using an experi- 
mental evidence code (example Inferred by Mutant 
Phenotype or Inferred by Direct Assay, see Figure 2). 
The biocurator could add an annotation to the protein 



serine/threonine/tyrosine kinase activity with ISS evi- 
dence code, see below. These annotations thus indicate 
what was experimentally shown in an article and what 
was predicted from sequence comparison. 

• The Molecular Function ontology also contains terms 
that describe protein-protein interactions. However, 
annotating to such terms, e.g. 'protein binding' 
[GO:000551 5], is done with careful consideration, as 
most proteins bind other proteins at one time or an- 
other. A rule of thumb is to determine whether the 
gene product being annotated is accomplishing a bio- 
logical purpose by binding to another protein: if so, 
protein binding could be one of its functions. If more 
specific information on the type of protein being 
bound is available then the annotation should be 
made to a more specific term. For example, if the 
gene product being annotated binds to a histone, 
then 'histone binding' [GO:0042393] is the appropriate 
term. 

• Many terms in the Molecular Function ontology impli- 
citly or explicitly imply the binding of a chemical or 
protein. In these cases, it is unnecessary to co-annotate 
the binding of the substrates, cofactors or products, as 
the enzymatic activity is defined by the compounds 
being bound, if only in a transition state. For example, 
while annotating to terms like 'ATPase activity' 
[GO:0016887] it is implicit that the gene product binds 
to ATP and thus it is not necessary to annotate to both 
'ATPase activity' and 'ATP binding' [GO:0005524]. 



Biological process 

Biological Process describes biological goals accomplished 
by one or more ordered assemblies of molecular functions. 
A biological process is not equivalent to a pathway. 
Specifically it does not represent any of the dynamics or 
dependencies that would be required to describe a path- 
way. Examples of broad Biological Process terms include 
'metabolic process', 'signaling' and 'death'. High-level pro- 
cesses such as 'cell death' [GO:0008219] can have both sub- 
types, such as 'apoptotic process' [GO:0006915], and 
subprocesses, such as 'apoptotic chromosome condensa- 
tion' [GO:0030263] (see Table 1). Experiments describing 
the phenotypes of mutant genes, genetic interactions and 
some in vitro assays, can all be informative about the bio- 
logical processes in which a gene product participates 
(Figure 2). 

• On occasion when authors present experimental results 
for a gene product's role in a specific type of process, 
they then extrapolate to infer its role in other related 
processes. The annotations made from a given article 
should only be for the processes experimentally 



Page 12 of 18 



Database, Vol. 2013, Article ID bat054, doi:10.1093/database/bat054 



Original article 



GO Evidence Code Decision Tree 



/'Non-Expenm entail 
I method J 




(what type of evidence is the annotation based on?J 



/'Author statement] 
^ from publication J 



No evidence 
is available . 



Is a single gene being 
mutated or compared to other 
alleles of the same gene? 



Will each annotation be 
individually reviewed & confirmed 
by a human annotator? 



s annotation based on £ 
genetic interaction 
with another gene? 



Is annotation based on a 
direct 1 to 1 physical interaction 
with another gene product? 



not 



Is annotation based on a 
direct assay for the function, process, 
or component of the gene product? 



Is annotation based on 
the expression pattern 
of the gene product? 



X 



Is annotation based on 
an author statement that 
cites a published reference as 
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genomic context of the gene? 



Is annotation based on 
an author statement that does 
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the source of the information? 
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analysis, typically including 
experimental data sets, and often 
including multiple data types? 



□ For curator reviewed annotations 

□ For annotations NOT reviewed by a curator 



Note on use of ND evidence code: 

Unlike the other evidence codes, the No Data (ND) code does not 
indicate evidence or a method from a specific reference. Rather, it 
indicates that the annotator looked at the available information and 
determined that nothing is known about the gene for a given aspect of 
GO (molecular function, biological process, or cellular component). 
The annotater will always look at all available literature for the gene. 
Depending on the resources and annotation philosophy of the 
annotating group, the annotater may also look at sequence 
comparison data to determine if any predictions may be made 
based on the sequence. 



Figure 2. GO Evidence code decision tree describing the process of choosing an evidence code. This flow chart is meant to orient 
the biocurator on the different categories of evidence codes and does not include the complete definitions of the evidence codes 
(Table 2). This chart will aid the biocurator to evaluate the reported method or results and map them to an appropriate evidence 
code; the biocurator should consult the detailed evidence code documentation available online from http://www.geneontology. 
org/GO.evidence.shtml. 



demonstrated in that study. For example, if the results 
show that a gene product can transport serine and 
threonine, but the authors extrapolate that the gene 
product can thus transport any amino acid, the gene 
product should be annotated only to 'serine transport' 
[GO:0032329] and 'threonine transport' [GO:0015826] 
and not to 'alanine transport' [GO:0032328], etc. 

• Similar to the above example, if the results show a re- 
sponse to a variety of stress conditions, it is best to 
capture that data with the specific terms rather than 
annotating to a higher-level term. For example, the 
Saccharomyces cerevisiae gene HSP12 is annotated to 
specific terms 'cellular response to heat' [GO:0034605], 
'cellular response to osmotic stress' [GO:0071470] and 
'cellular response to oxidative stress' [GO:0034599] (22) 
rather than the high level 'cellular response to stress' 
[GO:0033554]. Grouping terms such as 'cellular response 
to stress' are discouraged from use in direct annotations 
because an experiment would typically not describe the 
response to a global stress, but would rather test the 
response to a specific type of stress. 

• Direct versus indirect effect. Many GO Biological Process 
annotations are assertions based upon mutant pheno- 
types. When annotating based upon mutant phenotype 



results, it can be difficult to discern if a gene product is 
directly involved in the process for which the authors 
screened (assayed) or if its absence instead results in 
an indirect or downstream effect. For example if any 
of the S. cerevisiae proteins involved in 'RNA splicing' 
[GO:0008380] are mutated, translation is affected. This 
is a downstream effect because most of the genes 
encoding ribosomal proteins have introns (example, 
yeast ribosomal genes RPL2A, RPL2B, RPS11A, RPS11B) 
and if splicing genes are mutated, these ribosomal 
genes are not processed thereby affecting ribosomal as- 
sembly and hence translation. In this case the genes 
involved in splicing shouldn't be annotated to 'transla- 
tion' [GO:0006412]. Determining if a mutant phenotype 
reflects a direct or indirect effect requires general under- 
standing of the gene products as well as the biological 
process under investigation. However, in cases where 
little is known about the gene product or process, or 
what is known is not easily reconciled with a mutant 
phenotype, it is the responsibility of the biocurator to 
accurately reflect the conclusions made from the avail- 
able experiments. Such annotations should be revisited 
when new literature becomes available and should be 
replaced with a more specific term(s) if possible. 
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• Annotating from gene or protein expression studies. 
There are many expression studies that measure the 
levels of RNA molecular species or protein levels when 
an organism or cell line is exposed to various stimuli. 
Conclusions from these experiments can suggest that 
the over-expressed genes or proteins are involved in 're- 
sponding to that stimulus'. However, overexpression 
does not necessarily imply that those genes or proteins 
are directly involved in the 'response to the stimulus' 
[GO:0050896]. The 'response to' GO terms are intended 
to annotate gene products that are required for the re- 
sponse to occur and are a direct result of the organism's 
reaction to the stimuli (e.g. production of a gene prod- 
uct used to degrade a toxin or signaling to initiate 
immune cell division in response to a parasite). If nothing 
else is known about the gene product, it is acceptable to 
annotate to a child of 'response to stimulus' using the 
IEP evidence code. If more is known about the regula- 
tion of the gene product, then that should be taken into 
account to make a decision about annotating to the 're- 
sponse to' term. It is acceptable to not annotate from 
such expression studies since changes in expression of a 
gene product does not in itself indicate its contribution 
to the function or process. Also, expression studies can 
seldom support annotations to a Cellular Component or 
Molecular Function term. Thus IEP should be used to 
annotate to terms in Biological Process only. 

• Annotating to regulation terms in Biological Process. 
Regulation of a biological process is defined as a role 
that modulates the frequency, rate or extent of that 
process. 

o To decide if the gene product participates directly in 
a process or regulates that process, the nature of the 
process should be studied carefully (Is there a defined 
pathway? Is it a biochemical pathway and have the 
gene products that perform the individual steps been 
identified? Does the gene product being annotated 
function within the pathway or outside of the path- 
way to start or stop or change the rate of the 
process?) 

o If it cannot be determined whether the gene product 
is involved in the process itself or instead in regula- 
tion of the process (this can happen if the process is 
not well defined), then biocurators should annotate 
to the parent process term. For example, if a mutant 
phenotype shows that a specific process is missing in 
an organism but the nature of the function of the 
gene product is unknown, an annotation should be 
made to the parent process term. Note that processes 
in GO are defined to reflect the predominant com- 
munity view with respect to what is included in the 
process and what is influencing or regulating the pro- 
cess externally. 



o Some gene products can be annotated to both a pro- 
cess and regulation of that process as in the case of 
positive and negative feedback loops. 

Cellular component 

Cellular Component describes locations, at the levels of sub- 
cellular structures and macromolecular complexes. 
Experiments informing Cellular Component annotations in- 
clude fluorescence microscopy and co-fractionation of com- 
plex members. Examples of cellular components include 
'nuclear inner membrane' [GO:0005637], with the synonym 
'inner envelope', and the 'ubiquitin ligase complex' 
[GO:0000151] (see Table 1), with several subtypes of these 
complexes represented. 

• Care must be taken when interpreting a subcellular lo- 
cation, as certain tagged proteins may be mistargeted. 
For example, in Huh et al. (23), (see their 
Supplementary Table S2), the authors list several yeast 
proteins that were mislocalized to the vacuole or other 
components upon addition of a molecular tag. 

• When a macromolecular complex has been character- 
ized, all subunits of the complex should be annotated 
to an appropriate complex term in the Cellular 
Component ontology (example, 'spliceosomal complex' 
[GO:005681] or 'nucleosome' [GO:0000786]). Depending 
on the nature of the experiment, annotation to a com- 
plex can either be made using the IDA evidence code or 
the IPI evidence code. For example, if an author purifies 
a complex and then investigates the constituent gene 
products, a curator would use the IDA evidence code 
for annotation. If the authors instead perform protein- 
binding assays to show that a gene product binds to 
other members of the complex, then the IPI evidence 
code should be used with appropriate targets included 
in the WITH/FROM column (see below). 

• There are several terms in the Cellular Component 
ontology in the format 'x part' (e.g. 'nuclear part' 
[GO:0044428]; 'membrane part' [GO:0044425] etc.). 
These terms were added to make the ontology is_a 
complete (i.e. ontologically correct). Without additional 
qualifiers, annotation to these terms conveys no more 
information than annotation to the parent terms. 
Hence, these terms should not be used in making 
manual annotations. 

Additional information about the 
GO term (annotation extensions) 

Often, an article will contain more detailed information 
than existing GO terms can fully represent. In many such 
cases, biocurators may request new more specific terms to 
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be added to the ontology, but new GO terms may not 
always be the preferred solution. Rather, some informa- 
tion, such as the substrates of a protein kinase or the cell 
type in which a gene product has a particular localization, is 
best-captured using annotation extensions (also referred to 
as 'column 16' after its position in the GAF, Table 3). 
Additional information captured in this column provides 
more biological context to the GO annotation. 

An annotation extension has two parts: an entity identi- 
fier for the object that is used to increase the specificity of 
the annotation (e.g. identifiers for a gene, gene product, 
GO term or a term from an external ontology such as a cell 
type or anatomy ontology), and a relation that connects 
the 'primary' GO term to the entity represented by the 
identifier. The information captured in GO annotation ex- 
tensions encompasses several types of effector-target 
relationships. 

• The substrates of a function such as the target of a 
protein kinase. For example, the S. pombe win1 
(SPAC1006.09) protein has been annotated to 'MAP 
kinase kinase kinase activity' [GO:0004709] with the ex- 
tension 'has_direct_input (pombase:wis1)', where the 
S. pombe protein wis1 is the substrate of winl. 

• Activators and inhibitors, using the relationships activa- 
ted_by and inhibited_by. 

• Regulation targets of signaling pathways or transcrip- 
tion factors. For example, the S. pombe gene mapl is 
annotated to 'positive regulation of mating-type spe- 
cific transcription from RNA polymerase II promoter' 
[GO:0001197] with the extension 'has_regulation_target 
(PomBase:SPMTR.02)' indicating that SPMTR.02/matPi is 
the target of the regulation event. 

• Spatial aspects of processes or localizations, as in a spe- 
cific cell or tissue type as represented in the Cell Type 
Ontology (24), e.g. occursjn [CL:0000182], where 
CL0000182 identifies the cell type 'hepatocyte'. 

• Temporal aspects of a process or developmental stage, 
e.g. 'happens_during' for mitosis. For example, the 
S. pombe gene mug27 is annotated to 'septation initi- 
ation signaling cascade' [GO:0031028] with the exten- 
sion 'happensjduring meiotic cell cycle' [GO:0051321] 
implying that mug27 is involved in septation initiation 
signaling cascade that happens during meiotic cell cycle. 

An annotation may have one or more extensions, using the 
same or different relations. It is thus possible to capture 
multiple substrates of a kinase, for example. Compound 
extensions are also allowed, making it possible to indicate 
that two or more extensions apply simultaneously. For ex- 
ample, a gene product that is involved in a process only 
when it localizes to the nucleus, and only during S-phase 
of the cell cycle, can be annotated to a process term plus 
the extension 'occursjn nucleus', 'during S phase of mitotic 
cell cycle'. A list of allowed relationships are available in the 



go_annotation_extension_relations.obo file (http://viewvc. 
geneontology.org/viewvc/GO-SVN/trunk/ontology/extensio 
ns/go_annotation_extension_relations.obo) while the 
format for the various database identifiers can be found 
in the GO cross reference file (http://www.geneontology. 
org/doc/GO.xrf_abbs) (manuscript in preparation). 

Choice of evidence code 

Four different categories of evidence codes are available 
for manual curation: experimental, computational analysis, 
author statements and curatorial statements (details in 
Table 2, Figure 2). 

• Use of an experimental evidence code indicates that 
the cited article reported results that support the asso- 
ciation of a GO term from characterization of a gene or 
gene product. 

• Evidence codes in the computational analysis category 
imply that the annotation was inferred based on 
in silico analysis of the gene or gene product sequence 
and/or other data as cited in the reference. 
Biocurators can also perform in silico analysis, inde- 
pendent of a published article, to infer an annotation, 
in which case a GO Reference (GO_REF) that describes 
the methods used by the biocurator is used as 
reference. 

• Author statements include assertions made anywhere in 
the cited article, including the Introduction and 
Discussion. These evidence codes were made available 
by the GOC because during the initial stages of the 
project; curation of such statements was an easy way 
to get a good volume of annotations quickly. However, 
annotations using these evidence codes are now being 
replaced by those citing direct evidence. Use of author 
statement codes is discouraged and so they are not 
described in detail here. 

• Curatorial statements indicate that the biocurator re- 
viewed the information and made the appropriate an- 
notation decision. IC and ND are curatorial statement 
codes. The ND evidence code, which has been described 
earlier in the article, is used to indicate that there is no 
biological data available to infer any GO term for that 
gene product. The IC evidence code can be used in two 
different scenarios. The first case includes those in- 
stances where an annotation is not supported by any 
direct evidence, but can be reasonably inferred by a 
biocurator from other GO annotations, for which evi- 
dence is available. For example, if a gene product is 
shown experimentally to have the function of 'se- 
quence-specific DNA binding RNA polymerase II tran- 
scription factor activity' (GO:0000981), and there is no 
direct evidence for the cellular location of the gene 
product, then it is within general knowledge that this 
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function takes place in the nucleus and thus the bio- 
curator can infer the gene product's location. Both an- 
notations will use the same published article as 
reference and in addition the IC annotation will include 
the GOID used by the biocurator for the inference in 
the FROM (column 8 in the GAF 2.0 file. Table 3). In the 
second case, a curator infers an annotation based on 
evidence from multiple sources of evidence/GO annota- 
tion as described below. 

Data supporting the evidence code 

In addition to the evidence code that reflects the type of 
experiment leading to an annotation, the GOC provides 
two ways to capture additional evidence information for 
an annotation: the qualifier and the WITH/FROM column. 
A qualifier can be used to augment the interpretation of 
the GO term. Three qualifiers are available: colocalizes_ 
with, contributes_to and NOT. These are found in the 
QUALIFIER column of the GAF 2.0 format (Table 3). 

QUALIFIER 

• Sometimes, gene products are transiently or peripher- 
ally associated with an organelle or complex. These re- 
sults can be annotated to the relevant Cellular 
Component term along with the colocalizes_with quali- 
fier. The colocalizes_with qualifier can be used only 
with the Cellular Component ontology. For example, 
the 5. pombe protein clp1 is a nucleolar protein but 
transiently associates itself with the 'actomyosin con- 
tractile ring' [GO:0005826] (25). Hence dpi is annotated 
to this term with the colocalizes_with qualifier. 

• The contributesjto qualifier can be used only with 
Molecular Function terms. Sometimes complexes are 
shown to have an activity, but the activity of each sub- 
unit is not shown. In such cases, individual subunits that 
are part of a complex can be annotated to terms that 
describe the function of the complex. If the activity of 
the complex is associated with a single subunit and the 
other subunits serve either as regulatory subunits or to 
keep the complex together, then the subunits should 
be annotated to those specific activities. Contributesjto 
is not needed to annotate a catalytic subunit. 
Furthermore, contributesjto may be used for any 
noncatalytic subunit, whether the subunit is essential 
for the activity of the complex or not. In another 
usage, if two or more subunits of a complex are 
required for the catalytic activity of the complex, then 
all those subunits get annotated to the corresponding 
Molecular Function term with the contributesjto quali- 
fier. The gene products annotated to function terms 
with the contributesjto qualifier should also be 



annotated to the complex term in the Cellular 
Component that has that molecular function. For ex- 
ample, the subunits of the S. cerevisiae mitochondrial 
respiratory chain complex III are all annotated to the 
Molecular Function term 'ubiquinol-cytochrome-c re- 
ductase activity' [GO:0008121] with the contributesjto 
qualifier (26) and to the complex term 'mitochondrial 
respiratory chain complex IN' [GO:0005750] in the 
Cellular Component ontology. This qualifier is not 
used with terms in Biological Process ontology because 
biological processes are a collection of molecular events 
and by default gene products contribute to the whole 
process. 

• The negative of a GO term, the NOT qualifier. This 
qualifier is used to explicitly denote that the gene prod- 
uct is not associated with the function, process or com- 
ponent represented by the GO term. This qualifier is 
used when a gene product is expected to have a func- 
tion, but has been shown experimentally not to have 
the enzymatic activity; in this case the gene product can 
be annotated as NOT. For example, the NOT qualifier is 
used to indicate that the Caenorhabditis elegans gene 
C42C1.11a.2 was experimentally shown to NOT have 
'leukotriene-A4 hydrolase activity' [GO:0004463] despite 
strong homology to the human leukotriene A4 hydro- 
lase (27). Annotations that use the NOT qualifier can be 
particularly informative for evolutionary studies that 
wish to explore the gain and/or loss of gene product 
activity. 

WITH/FROM column 

• The WITH column is required for Inferred from 
Electronic Annotation (IEA), IGI, I PI, ISS, Inferred from 
Sequence Alignment (ISA) and Inferred from Sequence 
Orthology (ISO) codes (Table 2). 

• For example, when using ISS, the WITH column should 
be used to indicate the identifier of the gene product 
used for the sequence or structural comparison. For 
annotations based on sequence comparisons, it is im- 
portant to confirm that the protein used for the 
sequence comparison was experimentally verified to 
have that function and has a GO annotation reflecting 
that experimental finding. If a GO annotation is 
missing please report this to the GO consortium 
(go-helpdesk@geneontology.org). 

• Likewise, for IPI and IGI codes, the WITH column should 
be used to indicate the interacting gene product or 
gene respectively. Multiple identifiers can be entered 
in this field. 

• The FROM value is used to provide supporting infor- 
mation for the IC evidence code. For example if 
a Molecular Function annotation is made to 
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'sequence-specific DNA binding RNA polymerase II tran- 
scription factor activity' [GO:0000981] with experimen- 
tal evidence, and a biocurator deduces that the gene 
product thus resides in the nucleus, then the compo- 
nent annotation to nucleus is made with the FROM 
value GO:0000981. 
• In many cases a GO term can be inferred from just one 
other annotation, but occasionally a curator can also 
infer an annotation to a term based on evidence 
from multiple sources of evidence/GO annotation. The 
FROM value in these annotations will therefore supply 
more than one GO identifier, obtained from the set of 
supporting GO annotations assigned to the same gene/ 
gene product identifier which cite publicly available ref- 
erences and the annotation would have an unpublished 
GO reference (GO_REF:000036) in its Reference field. 

Suggested reading 

For examples of how GO annotations have been developed 
and how these guidelines have been put into practice 
please consult the following articles. The work on biofilm 
and filamentous growth in Candida (28), heart develop- 
ment (29), a case study of focused curation for renal and 
cardiovascular research (30) and in depth curation of the 
peroxisome proteome in humans (31) will be instructive for 
learning about curation of the literature to create GO 
annotations. 

Conclusions 

The goal of the GOC is the unification of biology by creat- 
ing a nomenclature used for describing the functional char- 
acteristics of any gene product, protein or RNA, from any 
organism. The GOC provides the research community a 
comprehensive resource of functional information on 
gene products. Toward this end, the GOC provides ontolo- 
gies, guidelines to make the gene product-to-GO term as- 
sociations and standardized formats to publish these 
annotations. This guide describes the methods used to 
create one of the two types of annotations that can be 
made with GO terms: manual curation. Consistency of GO 
annotations is paramount to ensure the quality of any ana- 
lysis using the annotations. An understanding of the re- 
quirements and strategies associated with the three 
aspects of the GO with those of the different evidence 
codes can ensure manual annotations will be an accurate 
representation of the published results. Our hope is that 
these guidelines will provide encouragement and assist- 
ance to researchers to annotate their favorite gene prod- 
ucts, enriching both the quality and quantity of GO 
annotations available via the GOC. 
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